The aim of this assignement is to fit and evaluate a prediction model to qualitatively asses the movement patterns of biceps curls. Sensory data of four inertial measurement units sampling at 45 Herz is provided. Each unit records acceleration in three axes, gyroscope and magnetometer data. The training data consists of measurements of 6 participants performing one set of 10 repetitions in five different movement patterns:
The complete R-Code can be found in the Appendix.
setwd("~/Desktop/PracticalMachineLearning")
data <- read.csv("pml-training.csv", stringsAsFactors = FALSE) # read the data into R
data$classe <- factor(data$classe) # make outcome variable a factor
In the origninal report Velloso et al. took overlapping windows from the time series data (2.5 sec) and calculated summary statistics to use as features for their prediction model (kurtosis, skewness, maximum, minimum, amplitude, variance, mean and standard deviation).
Since the testing data provided for in this assignement consists only of single data entries (snapshots) a timeseries approach is not possible. Therefore those calculated summary statistics in the training data set can be discarded. The timestamp variables are not synchronised to the beginning or end of the movement and no information about timing within a exercise can be gained from a single timestamp; they will be discarded as well.
# Create subset of possible features
df <- data %>%
select(-grep("kurtosis|skewness|max|min|amplitude|var|avg|stddev",names(data), value =F)) %>%
select(-new_window, -cvtd_timestamp, -user_name, -X,
-raw_timestamp_part_1,-raw_timestamp_part_2, -num_window)
The 52 remaining variables are the possible prediction features. Before any further analysis the model building data set will pe partitioned into a training and a test set.
#Partitioning the data set
set.seed(1984)
inTrain <- createDataPartition(df$classe, p=0.6, list = FALSE)
training <- df[inTrain,]
testing <- df[-inTrain,]
| Training | Testing | |
|---|---|---|
| Observations | 11776 | 7846 |
| Variables | 53 | 53 |
Table 1: Partitioning Dimensions
To see whether any variable is a scalar multiple of another variable or a linear combination of multiple other variables and can therefore be left out of the modelling process, linear dependency of the training data is tested.
# Detect linear combinations
trainMatrix <- as.matrix(training[,-53])
findLinearCombos(trainMatrix)$linearCombos$list
NULL
There are no linear dependencies in the training data.
To asses the correlation within the training data, the variables that most reduce the pair-wise correlation when removed are identified:
# Detect variables responsible for high pair-wise correlation
corMatrix <- cor(training[,-53]) # calculate correlation matrix
highCors <- findCorrelation(corMatrix, cutoff = .9, # find variables responsible for
names = TRUE, exact = TRUE) # high pair-wise correlation
| If those variables are removed, pair-wise correlation drops below 0.9 | |
| Variable Names | |
|---|---|
| 1 | accel_belt_z |
| 2 | roll_belt |
| 3 | accel_belt_y |
| 4 | accel_belt_x |
| 5 | pitch_belt |
| 6 | gyros_dumbbell_x |
| 7 | gyros_dumbbell_z |
| 8 | gyros_arm_x |
A broader overview can be gained by calculating a cross-correlation heatmap. Plot 1: Cross-Correlation Heatmap
Some areas of higher correlation espacialliy within variables originating from the same sensors (see belt variables in the bottom left) can be observed. If a modelling approach is chosen, that is sensible to high co-linearity the pair-wise correlation in the data should be reduced.
Variables that vary only marginally across the observations do not add information to the model and can be removed.
nearZeroVar(training) # testing for near zero variance
integer(0)
There are no variables with near zero variance in the training set.
Histograms of the normalised variables show their distributions. Plot 2: Histograms of all 52 possible Features
Many of the variables are not normaly distributed without an obvious transformation to a alleviate the problem. The choice of modelling algorithm should reflect this caveat. For further analysis see the Violin-Jitter plots in the appendix.
To model this data, an algorithm that can handle multinomial classification on a data set affected by some multicoliniearity and many variables that are clearly not normaly distributed is needed. In this setting the Random Forest is a robust algorithm.
In tuning the model three error metrics will be considered:
The strategy follows a 3 partition aproach. The train data will be split into a training and a testing data set, the 20 test cases will serve as validation data.
To find the right tuning parameters four models with four different numbers of trees in the forest (200,500,1000,2000) will be trained and evaluated. After the best fitting model is chosen three different mtry values (number of randomly sampled variables that can be split upon at each node) will be tested; the square root of the variable number, its double and its half. If the results show that the OOB and CV error rates are good approximations for the real out of sample error, the final model will be trained on the whole data set with the evaluated tuning parameters and validated on the 20 test cases.
#Fitting 4 RF Models
set.seed(1984)
mod1 <- randomForest(y = training$classe, x = training[,-53], ntree = 200, mtry = sqrt(52),
do.trace = TRUE, xtest = testing[,-53], ytest = testing$classe, keep.forest = T)
mod2 <- randomForest(y = training$classe, x = training[,-53], ntree = 500, mtry = sqrt(52),
do.trace = TRUE, xtest = testing[,-53], ytest = testing$classe, keep.forest = T)
mod3 <- randomForest(y = training$classe, x = training[,-53], ntree = 1000, mtry = sqrt(52),
do.trace = TRUE, xtest = testing[,-53], ytest = testing$classe, keep.forest = T)
mod4 <- randomForest(y = training$classe, x = training[,-53], ntree = 2000, mtry = sqrt(52),
do.trace = TRUE, xtest = testing[,-53], ytest = testing$classe, keep.forest = T)
# Cross-validating those 4 models
set.seed(1984)
mod1CV <- rf.crossValidation(mod1, xdata = training[,-53],
p = 0.05, n = 20, plot = F, do.trace = T)
mod2CV <- rf.crossValidation(mod2, xdata = training[,-53],
p = 0.05, n = 20, plot = F, do.trace = T)
mod3CV <- rf.crossValidation(mod3, xdata = training[,-53],
p = 0.05, n = 20, plot = F, do.trace = T)
mod4CV <- rf.crossValidation(mod3, xdata = training[,-53],
p = 0.05, n = 20, plot = F, do.trace = T)
The following plot compares the OOB-, cross-validation and the out-of-sample (test set) error rate for the three different models.
Plot 3: Cross-Validation, Out-Of-Bag and Out-Of-Sample Error
The cross-validation error plateaus early, while the error rate on the testing set improves with higher tree numbers. The accuracy is already very high for all models. The confusion matrices for those models show only small differences and can be seen in the appendix. The plotted prediciton error by outcome class shows no systematic difference between the four models (see appendix). The stable CV error and reduced test set error point to a higher ntree number (2000) for the final model.
Model4 was run with the default mtry = sqrt(variable number) = 7, now its double and its half will be trained and evaluated.
# Train two more models using different mtry values
set.seed(1984)
mod4double <- randomForest(y = training$classe, x = training[,-53], ntree = 2000,
mtry = sqrt(52)*2, do.trace = TRUE, xtest = testing[,-53],
ytest = testing$classe, keep.forest = T)
mod4half <- randomForest(y = training$classe, x = training[,-53], ntree = 2000,
mtry = sqrt(52)*0.5, do.trace = TRUE, xtest = testing[,-53],
ytest = testing$classe, keep.forest = T)
# Cross validate those two models
set.seed(1984)
mod4CVdouble <- rf.crossValidation(mod4double, xdata = training[,-53],
p = 0.05, n = 20, plot = F, do.trace = T)
mod4CVhalf <- rf.crossValidation(mod4half, xdata = training[,-53],
p = 0.05, n = 20, plot = F, do.trace = T)
Plot 4: Cross-Validation, Out-Of-Bag and Out-Of-Sample Error
The original value for mtry seems to train the best model fit. The OOB error estimate decreases marginally (but confidence intervals overlapp), but CV error and test set error increase with different values.
Since in all models so far the OOB and CV error rate estimates seem to be a good predictor of the out-of-sample error rate, the final model will use the whole training data set and the evaluated tuning criteria (ntree = 2000, mtry = sqrt(variable number))
# Train the final model
set.seed(1984)
finalModel <- randomForest(y = df$classe, x = df[,-53], ntree = 2000, mtry = sqrt(52),
do.trace = TRUE, keep.forest = T)
# Cross validate the final model
finalModelCV <- rf.crossValidation(finalModel, xdata = df[,-53],
p = 0.05, n = 50, plot = F, do.trace = T)
CV and OOB error rate show a further improvement compared to model 4. The OOB confusion matrix and the 10 most important feature variables by mean decrease in Gini Index are shown below.
Plot 5: Error Rates, OOB Confusion Matrix and Variable Importance PLot - Final Model
The following two tables show the statistical details of the final model. The OOB estimated error rate is below 0.3%!
| Final Model | |
|---|---|
| Accuracy | 0.9971 |
| Kappa | 0.9963 |
| Acc 95% CI lower | 0.9962 |
| Acc 95% CI upper | 0.9978 |
| No Information Rate | 0.2844 |
| P-Value (Acc > NIR) | 0.0000 |
| Mcnemar’s Test P-Value |
Table 2: Accuracy Statistics - Final Model
| Final Model | ||||||||
| Sensitivity | Specificity | PPV | NPV | Detection Rate | Detection Prevalence | Balanced Accuracy | ||
|---|---|---|---|---|---|---|---|---|
| Class: A | 0.999 | 0.999 | 0.998 | 1.000 | 0.284 | 0.284 | 0.285 | 0.999 |
| Class: B | 0.997 | 0.999 | 0.997 | 0.999 | 0.194 | 0.193 | 0.194 | 0.998 |
| Class: C | 0.996 | 0.998 | 0.993 | 0.999 | 0.174 | 0.174 | 0.175 | 0.997 |
| Class: D | 0.993 | 1.000 | 0.998 | 0.999 | 0.164 | 0.163 | 0.163 | 0.996 |
| Class: E | 0.998 | 1.000 | 0.999 | 1.000 | 0.184 | 0.183 | 0.184 | 0.999 |
Table 3: Statistics by Class - Final Model
# Predict class outcome with final model
predict(finalModel, newdata = validation)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
All 20 predictions were correct, but the sample size is to small for any further analysis.
A jitter plot overlayed with a violin shape of the normalise data can be used to assess the distribution of the different variables.
The confusion marticies on the testing data of the first four models.
The prediction error plot of the first four models.
library(caret); library(dplyr); library(ggplot2); library(htmlTable)
library(reshape2); library(viridis); library(doParallel); library(knitr); library(gridExtra)
library(randomForest); library(rfUtilities)
setwd("~/Desktop/PracticalMachineLearning")
data <- read.csv("pml-training.csv", stringsAsFactors = FALSE) # read the data into R
data$classe <- factor(data$classe) # make outcome variable a factor
# Create subset of possible features
df <- data %>%
select(-grep("kurtosis|skewness|max|min|amplitude|var|avg|stddev",names(data), value =F)) %>%
select(-new_window, -cvtd_timestamp, -user_name, -X,
-raw_timestamp_part_1,-raw_timestamp_part_2, -num_window)
#Partitioning the data set
set.seed(1984)
inTrain <- createDataPartition(df$classe, p=0.6, list = FALSE)
training <- df[inTrain,]
testing <- df[-inTrain,]
# Detect linear combinations
trainMatrix <- as.matrix(training[,-53])
findLinearCombos(trainMatrix)$linearCombos$list
# Detect variables responsible for high pair-wise correlation
corMatrix <- cor(training[,-53]) # calculate correlation matrix
highCors <- findCorrelation(corMatrix, cutoff = .9, # find variables responsible for
names = TRUE, exact = TRUE) # high pair-wise correlation
# Create html table
htmlTable(as.data.frame(highCors), align = "rrrr", header = "Variable Names",
caption = "If those variables are removed, pair-wise correlation
drops below 0.9")
# Create correlation matrix in long format
corMatrix <- melt(abs(cor(training[,-53]))) # bring matrix into long format
ggplot(corMatrix, aes(x = Var1, y = Var2, fill = value)) + # plot heatmap
geom_tile() +
theme_classic() +
scale_fill_viridis(discrete=FALSE) +
theme(axis.text.x = element_text(angle=45, vjust=0.5)) +
ggtitle(expression(atop("Correlation Heatmap",
atop(italic("absolute values"), "")))) +
labs(x="", y="")
nearZeroVar(training) # testing for near zero variance
# Normalise the data
trainNorm <- as.data.frame(scale(training[,-53]))
trainNorm$classe <- training$classe
# Bring normalised data into long format
trainMelt <- melt(trainNorm)
# Plot histograms for each variable, classe colored
ggplot(trainMelt, aes(x = value, fill = classe)) +
scale_x_continuous(limits = c(-4, 4)) +
facet_wrap(~ variable) +
geom_histogram(binwidth = .5) +
theme_classic() +
scale_fill_viridis(discrete=TRUE) +
ggtitle(expression(atop("Histogram",
atop(italic("stacked by classe, normalised data"), "")))) +
labs(x="", y="")
#Fitting 4 RF Models
set.seed(1984)
mod1 <- randomForest(y = training$classe, x = training[,-53], ntree = 200, mtry = sqrt(52),
do.trace = TRUE, xtest = testing[,-53], ytest = testing$classe, keep.forest = T)
mod2 <- randomForest(y = training$classe, x = training[,-53], ntree = 500, mtry = sqrt(52),
do.trace = TRUE, xtest = testing[,-53], ytest = testing$classe, keep.forest = T)
mod3 <- randomForest(y = training$classe, x = training[,-53], ntree = 1000, mtry = sqrt(52),
do.trace = TRUE, xtest = testing[,-53], ytest = testing$classe, keep.forest = T)
mod4 <- randomForest(y = training$classe, x = training[,-53], ntree = 2000, mtry = sqrt(52),
do.trace = TRUE, xtest = testing[,-53], ytest = testing$classe, keep.forest = T)
# Cross-validating those 4 models
set.seed(1984)
mod1CV <- rf.crossValidation(mod1, xdata = training[,-53],
p = 0.05, n = 20, plot = F, do.trace = T)
mod2CV <- rf.crossValidation(mod2, xdata = training[,-53],
p = 0.05, n = 20, plot = F, do.trace = T)
mod3CV <- rf.crossValidation(mod3, xdata = training[,-53],
p = 0.05, n = 20, plot = F, do.trace = T)
mod4CV <- rf.crossValidation(mod3, xdata = training[,-53],
p = 0.05, n = 20, plot = F, do.trace = T)
# If models are already calculated
load("mod1.RData"); load("mod2.RData"); load("mod3.RData"); load("mod4.RData");
load("mod1CV.RData"); load("mod2CV.RData"); load("mod3CV.RData"); load("mod4CV.RData");
# Create a list of models and CV-data
errorList <- list(mod1CV, mod2CV, mod3CV, mod4CV)
modelList <- list(mod1, mod2, mod3, mod4)
# Create a long format data frame with CV and OOB errors for all 4 models
statDf <- melt(data.frame(
mod1CV = mod1CV$Error.distribution,
mod1OOB = mod1CV$OOB.distribution,
mod2CV = mod2CV$Error.distribution,
mod2OOB = mod2CV$OOB.distribution,
mod3CV = mod3CV$Error.distribution,
mod3OOB = mod3CV$OOB.distribution,
mod4CV = mod4CV$Error.distribution,
mod4OOB = mod4CV$OOB.distribution))
# Calculate standart error for confidence intervals (95%)
statDfs <- statDf %>% group_by(variable) %>%
summarise(mean = mean(value),
stdev = sd(value),
se = stdev/sqrt(20),
ci.up = se * qt(0.975, df = 19),
ci.low = se * -qt(0.975, df = 19))
# Name the statistical data frame and prepare for left_join (factor levels)
statDfs <-
cbind(c("model1","model1","model2","model2","model3","model3","model4","model4"),
c("Mean CV Error","OOB Error", "Mean CV Error","OOB Error","Mean CV Error",
"OOB Error","Mean CV Error","OOB Error"),
statDfs)
names(statDfs) <- c("Model", "Error", "Variable", "mean", "stdev", "se", "ci.up", "ci.low")
statDfs$Error <- factor(statDfs$Error, levels = c("Mean CV Error","OOB Error","Test Set Error"))
# Set up empty data frame
errorDf <- data.frame()
# Fill data frame with calculated errors
for (i in 1:4) {
errorDf <- rbind(errorDf,
cbind(
paste0("model",i),
"Mean CV Error",
errorList[[i]]$cv.Summary[4,1]))
errorDf <- rbind(errorDf,
cbind(
paste0("model",i),
"OOB Error",
mean(predict(modelList[[i]]) != training$classe)))
errorDf <- rbind(errorDf,
cbind(
paste0("model",i),
"Test Set Error",
mean(predict(modelList[[i]], newdata = testing ) != testing$classe)))
}
# Process data frame
errorDf$V3 <- as.numeric(as.character(errorDf$V3))
names(errorDf) <- c("Model", "Error", "Value")
# Join it with the statistical data frame
plotDf <- left_join(errorDf, statDfs)
# Create an error bar chart by model with CI error bars where possible
ggplot(plotDf, aes(x = Model, y = Value, fill = Model)) +
geom_bar(stat = "identity") +
geom_errorbar(aes(ymin = Value + ci.low, ymax = Value + ci.up), width = .3) +
theme_classic() +
ggtitle(expression(atop("Estimated Model Error Rate",
atop(italic("Model 1-4, 95% CI Error Bars"), "")))) +
scale_fill_viridis(discrete=TRUE)+
theme(axis.text.x = element_text(angle=45, vjust=0.5)) +
facet_wrap(~ Error)
# Train two more models using different mtry values
set.seed(1984)
mod4double <- randomForest(y = training$classe, x = training[,-53], ntree = 2000,
mtry = sqrt(52)*2, do.trace = TRUE, xtest = testing[,-53],
ytest = testing$classe, keep.forest = T)
mod4half <- randomForest(y = training$classe, x = training[,-53], ntree = 2000,
mtry = sqrt(52)*0.5, do.trace = TRUE, xtest = testing[,-53],
ytest = testing$classe, keep.forest = T)
# Cross validate those two models
set.seed(1984)
mod4CVdouble <- rf.crossValidation(mod4double, xdata = training[,-53],
p = 0.05, n = 20, plot = F, do.trace = T)
mod4CVhalf <- rf.crossValidation(mod4half, xdata = training[,-53],
p = 0.05, n = 20, plot = F, do.trace = T)
# If models are already calculated
load("mod4double.RData"); load("mod4half.RData")
load("mod4CVdouble.RData"); load("mod4CVhalf.RData")
# Create a list of models and CV-data
errorList <- list(mod4CV, mod4CVdouble, mod4CVhalf)
modelList <- list(mod4, mod4double, mod4half)
# Create a long format data frame with CV and OOB errors for the 3 models
statDf <- melt(data.frame(
mod4CV = mod4CV$Error.distribution,
mod4OOB = mod4CV$OOB.distribution,
mod4doubleCV = mod4CVdouble$Error.distribution,
mod4doubleOOB = mod4CVdouble$OOB.distribution,
mod4halfCV = mod4CVhalf$Error.distribution,
mod4halfOOB = mod4CVhalf$OOB.distribution))
# Create a statistic data frame for CI error bars calculation
statDfs <- statDf %>% group_by(variable) %>%
summarise(mean = mean(value),
stdev = sd(value),
se = stdev/sqrt(20),
ci.up = se * qt(0.975, df = 19),
ci.low = se * -qt(0.975, df = 19))
# Name and process that data frame
statDfs <-
cbind(c("model4","model4","model4double","model4double",
"model4half","model4half"),
c("Mean CV Error","OOB Error", "Mean CV Error","OOB Error","Mean CV Error",
"OOB Error"),
statDfs)
names(statDfs) <- c("Model", "Error", "Variable", "mean", "stdev", "se", "ci.up", "ci.low")
nam <- c("model4","model4double","model4half")
statDfs$Error <- factor(statDfs$Error, levels = c("Mean CV Error","OOB Error","Test Set Error"))
# Set up empty data frame
errorDf <- data.frame()
# Fill this data frame with the error values for those 3 models
for (i in 1:3) {
errorDf <- rbind(errorDf,
cbind(
nam[i],
"Mean CV Error",
errorList[[i]]$cv.Summary[4,1]))
errorDf <- rbind(errorDf,
cbind(
nam[i],
"OOB Error",
mean(predict(modelList[[i]]) != training$classe)))
errorDf <- rbind(errorDf,
cbind(
nam[i],
"Test Set Error",
mean(predict(modelList[[i]], newdata = testing ) != testing$classe)))
}
# Process and name that data frame
errorDf$V3 <- as.numeric(as.character(errorDf$V3))
names(errorDf) <- c("Model", "Error", "Value")
# Join it with the statistical data frame
plotDf <- left_join(errorDf, statDfs)
# Create an error bar chart by model with CI error bars where possible
ggplot(plotDf, aes(x = Model, y = Value, fill = Model)) +
geom_bar(stat = "identity") +
geom_errorbar(aes(ymin = Value + ci.low, ymax = Value + ci.up), width = .3) +
theme_classic() +
ggtitle(expression(atop("Estimated Model Error Rate",
atop(italic("mtry Permutations, 95% CI Error Bars"), "")))) +
scale_fill_viridis(discrete=TRUE)+
theme(axis.text.x = element_text(angle=45, vjust=0.5)) +
facet_wrap(~ Error)
# Train the final model
set.seed(1984)
finalModel <- randomForest(y = df$classe, x = df[,-53], ntree = 2000, mtry = sqrt(52),
do.trace = TRUE, keep.forest = T)
# Cross validate the final model
finalModelCV <- rf.crossValidation(finalModel, xdata = df[,-53],
p = 0.05, n = 50, plot = F, do.trace = T)
# If model is already calculated
load("finalModel.RData"); load("finalModelCV.RData")
# Create a long format data frame with CV and OOB errors for the final model
statDf <- melt(data.frame(
finalModelCV = finalModelCV$Error.distribution,
finalModelOOB = finalModelCV$OOB.distribution))
# Create a statistic data frame for CI error bars calculation
statDfs <- statDf %>% group_by(variable) %>%
summarise(mean = mean(value),
stdev = sd(value),
se = stdev/sqrt(20),
ci.up = se * qt(0.975, df = 19),
ci.low = se * -qt(0.975, df = 19))
# Name and process that data frame
statDfs <- cbind(c("finalModel","finalModel"),
c("Mean CV Error","OOB Error"),
statDfs)
names(statDfs) <- c("Model", "Error", "Variable", "mean", "stdev", "se", "ci.up", "ci.low")
statDfs$Error <- factor(statDfs$Error, levels = c("Mean CV Error","OOB Error"))
# Set up empty data frame
errorDf <- data.frame()
# Fill this data frame with the error values for those 3 models
errorDf <- rbind(errorDf,
cbind(
"finalModel",
"Mean CV Error",
finalModelCV$cv.Summary[4,1]))
errorDf <- rbind(errorDf,
cbind(
"finalModel",
"OOB Error",
mean(predict(finalModel) != df$classe)))
# Process and name that data frame
errorDf$V3 <- as.numeric(as.character(errorDf$V3))
names(errorDf) <- c("Model", "Error", "Value")
# Join it with the statistical data frame
plotDf <- left_join(errorDf, statDfs)
# Create an error bar chart by model with CI error bars
p1 <- ggplot(plotDf, aes(x = Error, y = Value, fill = Model)) +
geom_bar(stat = "identity", position = "dodge") +
geom_errorbar(aes(ymin = Value + ci.low, ymax = Value + ci.up), width = .2) +
geom_text(aes(label = round(Value,5), x = Error, y = 0.001, color = "grey")) +
theme_classic() +
theme(legend.position="none") +
ggtitle(expression(atop("Estimated Final Model Error Rate",
atop(italic("95% CI Error Bars"), "")))) +
scale_fill_viridis(discrete=TRUE)+
theme(axis.text.x = element_text(angle=45, vjust=0.5)) +
ylim(0, 0.007)
# Create OOB confusion matric plot
# predict without new data on RF object gives the OOB predictions on training data
M <- confusionMatrix(predict(finalModel), df$classe)
p2 <- ggplot(as.data.frame(M$table),
aes(x = Prediction, y = Reference, fill = log(Freq+1))) +
geom_tile() +
geom_text(aes(label = Freq)) +
scale_fill_viridis(discrete=FALSE) +
ggtitle(paste0("OOB Confusion Matrix Final Model",i)) +
theme_classic()
# Create data frame for variable importance plot
# adapted from:
# Function author: ramhiser
# See https://gist.github.com/ramhiser/6dec3067f087627a7a85
var_importance <- data_frame(variable=setdiff(colnames(training), "classe"),
importance=as.vector(importance(finalModel)))
var_importance <- arrange(var_importance, desc(importance))
var_importance$variable <- factor(var_importance$variable,
levels=var_importance$variable)
p3 <- ggplot(head(var_importance,10),
aes(x=variable, weight=importance, fill=variable)) +
geom_bar() + coord_flip() +
ggtitle("Variable Importance Final Model") +
xlab("") +
ylab("Variable Importance (Mean Decrease in Gini Index)") +
scale_fill_viridis(discrete = TRUE, name="Variable Name") +
theme_classic() +
theme(axis.text.y=element_blank())
# plot 3 analytic plots on one grid
grid.arrange(p1,p2,p3, layout_matrix = matrix(c(1,3,2,3), ncol =2))
# Table 2
htmlTable(txtRound(as.matrix(M$overall),4),
header = c("Final Model"), rnames = c("Accuracy","Kappa", "Acc 95% CI lower",
"Acc 95% CI upper",
"No Information Rate", "P-Value (Acc > NIR)",
"Mcnemar's Test P-Value"))
# Table 3
htmlTable(txtRound(M$byClass,3), caption = "Final Model",
header = c("","Sensitivity ", "Specificity ", "PPV", " NPV ",
" Detection Rate ", " Detection Prevalence ", " Balanced Accuracy "),
align = "rrrrrr")
# Read in Validation Data
validation <- read.csv("pml-testing.csv", stringsAsFactors = FALSE)
# Preprocess like the training data
validation <- validation %>%
select(-grep("kurtosis|skewness|max|min|amplitude|var|avg|stddev",names(data), value =F)) %>%
select(-new_window, -cvtd_timestamp, -user_name, -X,
-raw_timestamp_part_1,-raw_timestamp_part_2, -num_window)
# Predict class outcome with final model
valPredictions <- predict(finalModel, newdata = validation)
setwd("~/Desktop/PracticalMachineLearning/Predictions")
# Write-up function
pml_write_files = function(x){
n = length(x)
for(i in 1:n){
filename = paste0("problem_id_",i,".txt")
write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
}
}
# Create files
pml_write_files(valPredictions)